BUG: Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396 #12403

multiloc · 2016-02-20T23:06:11Z

closes Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry
concat to handle tz-aware all NaT/mixed cases
added unit tests

jreback · 2016-02-21T16:42:15Z

pandas/core/internals.py

@@ -4802,6 +4802,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):

            if self.is_null and not getattr(self.block, 'is_categorical',
                                            None):
+                if isinstance(empty_dtype, DatetimeTZDtype):
+                    # handle timezone-aware all NaT cases


see if u can make this if else flow s bit more

what i mean is make this an else-if so this cleanly handles the different dtypes (categorical, datetime with tz) and is obvious what is going on

multiloc · 2016-02-21T17:05:05Z

Could do

if self.is_null and not getattr(self.block, 'is_categorical',
                                None):
    num_elements = np.prod(self.shape)
    if isinstance(empty_dtype, DatetimeTZDtype):
        # handle timezone-aware all NaT cases
        return DatetimeIndex([fill_value] * num_elements,
                             dtype=empty_dtype)
    else:
        missing_arr = np.empty(self.shape, dtype=empty_dtype)
        if num_elements:
            # ...
            missing_arr.fill(fill_value)
        return missing_arr

jreback · 2016-02-22T14:02:14Z

pandas/core/internals.py

@@ -4802,6 +4802,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):

            if self.is_null and not getattr(self.block, 'is_categorical',
                                            None):
+                if isinstance(empty_dtype, DatetimeTZDtype):


you can also blow away the numpy 1.6 workaround comment / code.

jreback · 2016-02-22T14:05:20Z

@hcontrast see what you can do here. This involves cleaning up some code and making it more generic to handle the extension dtypes. lmk if you get stuck.

multiloc · 2016-02-24T11:20:47Z

Ok. Will have to wait until the weekend though

jreback · 2016-03-23T17:08:42Z

I modified some things in #12702

pls update/rebase when that's merged

multiloc · 2016-03-23T21:35:31Z

Rebased on top of master. Having trouble running tests locally though

jreback · 2016-03-23T22:05:36Z

pandas/tools/tests/test_merge.py

+        dtype = DatetimeTZDtype('ns', tz='UTC')
+        first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype)
+
+        result = pd.concat([first, second], 0)


always write out parameters axis=0 if you are passing them.

jreback · 2016-04-18T17:34:15Z

sorry, meant here. can you rebase/update?

multiloc · 2016-04-18T19:47:48Z

Updated the tests, but the mixed tz-/non-tz-aware cases are now breaking again:

======================================================================
ERROR: test_concat_NaT_dataframes_all_NaT_axis_0 (pandas.tools.tests.test_merge.TestConcatenate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../dev/pandas/pandas/tools/tests/test_merge.py", line 2606, in test_concat_NaT_dataframes_all_NaT_axis_0
    result = pd.concat([first, second], axis=0)
  File "/.../dev/pandas/pandas/tools/merge.py", line 836, in concat
    return op.get_result()
  File "/.../dev/pandas/pandas/tools/merge.py", line 1024, in get_result
    concat_axis=self.axis, copy=self.copy)
  File "/.../dev/pandas/pandas/core/internals.py", line 4544, in concatenate_block_managers
    for placement, join_units in concat_plan]
  File "/.../dev/pandas/pandas/core/internals.py", line 4641, in concatenate_join_units
    for ju in join_units]
  File "/.../dev/pandas/pandas/core/internals.py", line 4907, in get_reindexed_values
    missing_arr = np.empty(self.shape, dtype=empty_dtype)
TypeError: data type not understood

where

# pandas/tools/tests/test_merge.py:2601

# one side timezone-aware
dtype = DatetimeTZDtype('ns', tz='UTC')
first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype)

result = pd.concat([first, second], axis=0)

Looks like accidental/eager upcasting on closer inspection

block:  DatetimeTZBlock: slice(0, 1, 1), 1 x 2, dtype: datetime64[ns, UTC]
is_datetime_tz:  True
empty dtype:  datetime64[ns, UTC]
block:  DatetimeBlock: slice(0, 1, 1), 1 x 1, dtype: datetime64[ns]
is_datetime_tz:  False
empty dtype:  datetime64[ns, UTC]

The second block is regular datetime but gets passed a tz-aware empty value, hence triggering the wrong case.

jreback · 2016-04-18T20:00:02Z

ok I just pushed a change where we moved all of the concat stuff together (now in pandas/types/concat.py, was mostly from core/common.py).

use this to handle the path

com.is_extension_type(empty_dtype)
True

multiloc · 2016-04-18T20:12:50Z

Thanks. I've applied that fix, only outstanding issue is the mixed Series-DF case, which doesn't preserve the dtype information correctly:

FAIL: test_concat_NaT_series_dataframe_all_NaT (pandas.tools.tests.test_merge.TestConcatenate)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/.../dev/pandas/pandas/tools/tests/test_merge.py", line 2696, in test_concat_NaT_series_dataframe_all_NaT
    assert_frame_equal(result, expect, check_dtype=True)
  File "/.../dev/pandas/pandas/util/testing.py", line 1166, in assert_frame_equal
    obj='DataFrame.iloc[:, {0}]'.format(i))
  File "/.../dev/pandas/pandas/util/testing.py", line 1026, in assert_series_equal
    assert_attr_equal('dtype', left, right)
  File "/.../dev/pandas/pandas/util/testing.py", line 804, in assert_attr_equal
    left_attr, right_attr)
  File "/.../dev/pandas/pandas/util/testing.py", line 915, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  object
[right]: datetime64[ns]

    def test_concat_NaT_series_dataframe_all_NaT(self):
        # GH 12396

        # non-timezone aware
        first = pd.Series([pd.NaT, pd.NaT])
        second = pd.DataFrame([[pd.Timestamp('2015/01/01')],
                               [pd.Timestamp('2016/01/01')]])

        expect = pd.DataFrame([pd.NaT, pd.NaT, second.iloc[0, 0],
                               second.iloc[1, 0]], index=[0, 1, 0, 1])

        result = pd.concat([first, second])
        assert_frame_equal(result, expect)

        # one side timezone-aware
        dtype = DatetimeTZDtype('ns', tz='UTC')
        first = first.astype(dtype)
        result = pd.concat([first, second])
        assert_frame_equal(result, expect, check_dtype=True)
        self.assertEqual(result.dtypes.iloc[0], second.dtypes[0])

        # both sides timezone-aware
        second = second.apply(lambda x: x.astype(dtype))

        result = pd.concat([first, second])
        assert_frame_equal(result, expect, check_dtype=True)
        self.assertEqual(result.dtypes.iloc[0], second.dtypes[0])

jreback · 2016-04-18T20:18:22Z

that last already seems to work

multiloc · 2016-04-18T21:10:58Z

Might be due to the patch then

jreback · 2016-05-08T18:23:17Z

pandas/core/internals.py

@@ -4911,6 +4911,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na):
                    pass
                elif getattr(self.block, 'is_sparse', False):
                    pass
+                elif com.is_extension_type(empty_dtype):
+                    num_elements = np.prod(self.shape)
+                    # handle timezone-aware all NaT cases


just do this if it's a datetime tz type

jreback · 2016-05-08T19:26:21Z

pandas/core/internals.py

+                elif com.is_extension_type(empty_dtype) and \
+                        com.is_datetimetz(empty_dtype):
+                    num_elements = np.prod(self.shape)
+                    # handle timezone-aware all NaT cases


no, do this under the if getattr(self.block, 'is_datetimetz', False) or com.is_datetimetz(empty_dtype)

Like so? This passes the tests but not sure how good the coverage is for the non-NaT tz-aware cases.

if getattr(self.block, 'is_datetimetz', False) \ or com.is_datetimetz(empty_dtype): num_elements = np.prod(self.shape) # handle timezone-aware all NaT cases return DatetimeIndex([fill_value] * num_elements, dtype=empty_dtype) elif getattr(self.block, 'is_categorical', False): pass elif getattr(self.block, 'is_sparse', False): pass

yes more like this. yes coverage of extension types is prob only so now

codecov-io · 2016-05-08T20:08:35Z

Current coverage is 84.14%

Merging #12403 into master will increase coverage by +<.01%

@@             master     #12403   diff @@
==========================================
  Files           137        137          
  Lines         50287      50288     +1   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42311      42313     +2   
+ Misses         7976       7975     -1   
  Partials          0          0

File ...das/tseries/index.py (not in diff) was modified. more
- Misses -1
- Partials 0
- Hits +1

Powered by Codecov. Last updated by af7bdd3...10406df

jreback · 2016-05-20T14:24:13Z

ok (this is the original issue)

In [1]: df1 = pd.DataFrame([[pd.NaT], [pd.NaT]])

In [2]: df2 = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')], [pd.Timestamp('2016/01/01', tz='UTC')]])

In [4]: pd.concat([df1, df2])
Out[4]: 
                          0
0                       NaT
1                       NaT
0 2015-01-01 00:00:00+00:00
1 2016-01-01 00:00:00+00:00

In [5]: pd.concat([df1, df2]).dtypes
Out[5]: 
0    datetime64[ns, UTC]
dtype: object

This is incorrect; we are coercing to object (correct), but the NaT are getting incorrectly corced to nan here.

In [6]: df2 = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')], [pd.Timestamp('2016/01/01', tz='US/Eastern')]])

In [7]: pd.concat([df1, df2]).dtypes
Out[7]: 
0    object
dtype: object

In [8]: pd.concat([df1, df2])
Out[8]: 
                           0
0                        NaN
1                        NaN
0  2015-01-01 00:00:00+00:00
1  2016-01-01 00:00:00-05:00

jreback · 2016-05-31T15:21:04Z

@hcontrast can you update. only small fix I think.

multiloc · 2016-05-31T20:03:31Z

Added test case for this, indeed still breaking

multiloc · 2016-05-31T20:17:05Z

Can't actually get the test to break. This should break but doesn't

        first = pd.DataFrame([[pd.NaT], [pd.NaT]])
        second = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')],
                               [pd.Timestamp('2016/01/01', tz='US/Eastern')]])

        expect = pd.DataFrame([pd.NaT, pd.NaT, second.iloc[0, 0],
                               second.iloc[1, 0]], index=[0, 1, 0, 1])

        result = pd.concat([first, second], axis=0)
        assert_frame_equal(result, expect)
        print expect
        print result

gives

                           0
0                        NaT
1                        NaT
0  2015-01-01 00:00:00+00:00
1  2016-01-01 00:00:00-05:00
                           0
0                        NaN
1                        NaN
0  2015-01-01 00:00:00+00:00
1  2016-01-01 00:00:00-05:00

jreback · 2016-05-31T20:26:34Z

This is a tricky one; we have this flag to test exactness (so do this for now).

In [27]: tm.assert_frame_equal(result, expect, check_datetimelike_compat=True)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-27-441a8feab35d> in <module>()
----> 1 tm.assert_frame_equal(result, expect, check_datetimelike_compat=True)

/Users/jreback/pandas/pandas/util/testing.pyc in assert_frame_equal(left, right, check_dtype, check_index_type, check_column_type, check_frame_type, check_less_precise, check_names, by_blocks, check_exact, check_datetimelike_compat, check_categorical, check_like, obj)
   1268                 check_datetimelike_compat=check_datetimelike_compat,
   1269                 check_categorical=check_categorical,
-> 1270                 obj='DataFrame.iloc[:, {0}]'.format(i))
   1271 
   1272 

/Users/jreback/pandas/pandas/util/testing.pyc in assert_series_equal(left, right, check_dtype, check_index_type, check_series_type, check_less_precise, check_names, check_exact, check_datetimelike_compat, check_categorical, obj)
   1131         else:
   1132             assert_numpy_array_equal(left.get_values(), right.get_values(),
-> 1133                                      check_dtype=check_dtype)
   1134     else:
   1135         _testing.assert_almost_equal(left.get_values(), right.get_values(),

/Users/jreback/pandas/pandas/util/testing.pyc in assert_numpy_array_equal(left, right, strict_nan, check_dtype, err_msg, obj)
   1034     # compare shape and values
   1035     if not array_equivalent(left, right, strict_nan=strict_nan):
-> 1036         _raise(left, right, err_msg)
   1037 
   1038     if check_dtype:

/Users/jreback/pandas/pandas/util/testing.pyc in _raise(left, right, err_msg)
   1028             msg = '{0} values are different ({1} %)'\
   1029                 .format(obj, np.round(diff, 5))
-> 1030             raise_assert_detail(obj, msg, left, right)
   1031 
   1032         raise AssertionError(err_msg)

/Users/jreback/pandas/pandas/util/testing.pyc in raise_assert_detail(obj, message, left, right)
    983 [left]:  {2}
    984 [right]: {3}""".format(obj, message, left, right)
--> 985     raise AssertionError(msg)
    986 
    987 

AssertionError: numpy array are different

numpy array values are different (50.0 %)
[left]:  [nan, nan, 2015-01-01 00:00:00+00:00, 2016-01-01 00:00:00-05:00]
[right]: [NaT, NaT, 2015-01-01 00:00:00+00:00, 2016-01-01 00:00:00-05:00]

@sinhrks you think we should be checking for this by default? (I forgot why I added this, IIRC maybe for stata tests or something)

sinhrks · 2016-06-13T00:23:04Z

Or we should change assert_numpy_array_equal to use strict_nan=True. Not sure how many tests are relying on this.

jreback · 2016-06-24T10:08:20Z

@hcontrast can you update

jreback · 2016-06-30T10:16:33Z

can you rebase and update

jreback · 2016-07-06T22:10:42Z

can you rebase & update

jreback · 2016-09-09T22:38:59Z

can you rebase / update?

jreback · 2016-11-16T22:22:21Z

closing, but pls reopen / comment if you can continue

jreback · 2016-11-16T22:22:43Z

this is almost there .....

paul-mannino · 2018-01-20T02:10:20Z

Want to knock this one out in conjunction with #18447. Just one question, what's the general desired behavior for missing data in concatenations? Should existing NaTs and nans always remain as they are, and otherwise use whatever the fill value is for the block for any new missing data created by the concatenation?

jreback · 2018-01-20T02:21:15Z

nan is always by type
eg NaT for daetimelile and nan otherwise

the tests look pretty good here

jreback · 2018-01-20T02:21:29Z

go ahead and open a new PR

multiloc mentioned this pull request Feb 20, 2016

Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396

Closed

jreback reviewed Feb 21, 2016
View reviewed changes

jreback changed the title ~~BUG fix GH12396:~~ BUG: Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396 Feb 22, 2016

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Feb 22, 2016

jreback added this to the 0.18.0 milestone Feb 22, 2016

jreback reviewed Feb 22, 2016
View reviewed changes

jreback modified the milestones: 0.18.1, 0.18.0 Feb 23, 2016

jreback mentioned this pull request Feb 27, 2016

BUG: fix df concat containing mix of localized and non-localized Timestamps #12462

Closed

jreback modified the milestones: 0.18.0, 0.18.1 Feb 27, 2016

multiloc force-pushed the bugfix_12396 branch from 7c50158 to dd3468a Compare March 23, 2016 21:34

jreback reviewed Mar 23, 2016
View reviewed changes

jreback reviewed May 8, 2016
View reviewed changes

handle tz-aware all NaT concatenation incl. mixed cases (GH12396)

44a0fcf

multiloc force-pushed the bugfix_12396 branch from e5e356d to 44a0fcf Compare May 8, 2016 18:31

jreback reviewed May 8, 2016
View reviewed changes

afragner added 2 commits May 8, 2016 22:39

improve tests for all-NaT concatenation (GH12396)

f95bb83

clean up tests for GH12396

10406df

jreback modified the milestones: Next Major Release, 0.18.2 Jul 6, 2016

jreback closed this Nov 16, 2016

BUG: Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396 #12403

BUG: Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396 #12403

Conversation

multiloc commented Feb 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

multiloc commented Feb 21, 2016

Choose a reason for hiding this comment

jreback commented Feb 22, 2016

multiloc commented Feb 24, 2016

jreback commented Mar 23, 2016

multiloc commented Mar 23, 2016

Choose a reason for hiding this comment

jreback commented Apr 18, 2016

multiloc commented Apr 18, 2016

jreback commented Apr 18, 2016

multiloc commented Apr 18, 2016

jreback commented Apr 18, 2016

multiloc commented Apr 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented May 8, 2016 • edited Loading

Current coverage is 84.14%

jreback commented May 20, 2016

jreback commented May 31, 2016

multiloc commented May 31, 2016

multiloc commented May 31, 2016

jreback commented May 31, 2016

sinhrks commented Jun 13, 2016

jreback commented Jun 24, 2016

jreback commented Jun 30, 2016

jreback commented Jul 6, 2016

jreback commented Sep 9, 2016

jreback commented Nov 16, 2016

jreback commented Nov 16, 2016

paul-mannino commented Jan 20, 2018 • edited Loading

jreback commented Jan 20, 2018

jreback commented Jan 20, 2018

codecov-io commented May 8, 2016 •

edited

Loading

paul-mannino commented Jan 20, 2018 •

edited

Loading